1 Introduction

This repository contains the data and results to reproduce the paper by Nguyen et al., submitted to Water Resources Research. This vignette shows the step-by-step computations to reproduce the paper’s results and figures. We also provide some additional details that may be of interest to some readers.

We provide the data in the folder data/. Unfortunately, we can’t give you the instrumental streamflow data for the Yangtze, Mekong, and Pearl Rivers, due to restrictions. Therefore, if you run the code here, you will not get results for these rivers for some computations that require instrumental data. We provide all results in the folder results/, including those for these rivers.

R scripts that contain the intermediate computational steps and utility functions are provided in the folder R/. You may want to check them out before we proceed.

We start by loading the necessary packages and utility functions (if you don’t already have some packages, please install them first).

2 Data

Let’s read the main data.

Now let’s explore the data sets.

2.1 Streamflow data

2.1.1 Metadata

Table S2.

Note. In the paper, we omitted the leading zeros in station IDs due to space constraints. Here we are using the full station IDs. The column code is an encoding for plotting, which will be useful in Figure 5.

3 Climate-informed input variable selection

The main idea is to select MADA grid points that are in a similar climate to the streamflow station of interest. We characterize climate using the KWF hydroclimate system.

3.1 KWF hydroclimate system

We provide the hydroclimate classification system in the file data/kwf.RDS, only for the Monsoon Asia domain.

long and lat are the coordinates of the grid point. x1, x2, y1, y2 are the four corners of the grid cell, which will be used to determine the cell that contains any point on Earth. arid, seas, and snow are the three KWF indices: aridity, seasonality, and snow fraction. col is the RGB colour created from the three indices.

Let’s visualize this data set.

With this, we can identify the KWF cell of each MADA grid cell, so as to determine its climate. The MADA v2 has resolution \(1^\circ \times 1^\circ\) and the KWF has resolution \(0.5^\circ \times 0.5^\circ\). The MADA grid lies nicely on the KWF grid, so all we need to do is a simple left joint.

3.2 Select MADA grid points

From here on, we’ll need to use lots of parallel computing, so let’s set this up.

Now we can select MADA grid points based on the KWF distance, using get_mada_by_kwf(). This will take a few seconds on a normal desktop.

Alternatively, you can read the pre-computed results.

Let’s now calculate the correlation between streamflow and the MADA for the selected stations on the Krishna and Chao Phraya (as in the paper), so that we can compare the significantly correlated areas with the search area, like in Figure 3. The file R/correlation_functions.R contains a utitilty function to determine the boundary lines of significant areas.

We’re now ready to reproduce Figure 3.

selectedPointPlot <- ggplot(inputPoints, aes(long, lat)) +
  geom_tile(data = mada2xy, width = 1, height = 1, fill = 'gray') +
  geom_tile(fill = '#4daf4a', width = 1, height = 1) +
  geom_point(data = sxy, colour = 'red') +
  scale_x_continuous(expand = c(0, 0), labels = pasteLong) +
  scale_y_continuous(expand = c(0, 0), position = 'right') +
  coord_quickmap() +
  facet_grid(name ~ kwf, switch = 'y',
             labeller = labeller(kwf = label_parsed)) +
  theme_bw() +
  theme(axis.ticks.y = element_blank(),
        axis.text.y = element_blank(),
        axis.line = element_blank(),
        axis.title = element_blank(),
        strip.background = element_blank(),
        panel.grid = element_blank(),
        plot.title = element_text(hjust = 0.5)) +
  labs(title = 'Selected MADA grid points')

corPlot <- ggplot(subCor) +
  geom_raster(aes(long, lat, fill = rho)) +
  geom_segment(aes(x = long - 0.5, xend = long + 0.5, y = lat + 0.5, yend = lat + 0.5),
               size = 0.1,
               data = subCorSignif[top == TRUE]) +
  geom_segment(aes(x = long - 0.5, xend = long + 0.5, y = lat - 0.5, yend = lat - 0.5),
               size = 0.1,
               data = subCorSignif[bottom == TRUE]) +
  geom_segment(aes(x = long - 0.5, xend = long - 0.5, y = lat - 0.5, yend = lat + 0.5),
               size = 0.1,
               data = subCorSignif[left == TRUE]) +
  geom_segment(aes(x = long + 0.5, xend = long + 0.5, y = lat - 0.5, yend = lat + 0.5),
               size = 0.1,
               data = subCorSignif[right == TRUE]) +
  geom_point(aes(Qlong, Qlat), colour = 'red') +
  scale_x_continuous(expand = c(0, 0), labels = pasteLong) +
  scale_y_continuous(expand = c(0, 0), labels = pasteLat) +
  scale_fill_distiller(name = 'Correlation', palette = 'RdBu', direction = 1, 
                       breaks = scales::pretty_breaks(3), limits = absRange(subCor$rho)) +
  coord_quickmap() +
  labs(x = NULL, y = NULL) +
  facet_wrap(~name, ncol = 1, strip.position = 'right') +
  theme_bw() +
  theme(panel.grid = element_blank(),
        axis.line = element_blank(),
        legend.position = 'top',
        legend.key.width = unit(0.6, 'cm'),
        strip.background = element_blank())

s2 <- plot_grid(selectedPointPlot, corPlot, 
                ncol = 2, align = 'hv', axis = 'trbl', rel_widths = c(1.75, 1),
                labels = c('a)', 'b)'), label_size = 10)

s2a <- ggdraw(s2) +
  draw_label('(Krishna)',     0.62, 0.650, size = 10, angle = 90) +
  draw_label('(Chao Phraya)', 0.62, 0.250, size = 10, angle = 90)
s2a

3.3 Weighted Principal Component Analysis

We need to calculate the correlation between each grid cell and each station. The result is a correlation matrix, each row is a station and each column is a MADA grid point.

We also need to setup the p parameter, and do some small preparations.

The previous step gives us a 30-member ensemble from 5 KWF distance and 6 PCA weights. You can type View(pca) to see the results. Each ensemble member contains a set of weighted PCs.

Next, use the VSURF algorithm to select a subset of principal components for each ensemble. This code will take about 20 minutes, so get yourself a drink while it runs. Or skip to the next chunk and read the pre-computed results.

ivs contains the names of the selected PCs. Type View(ivs) to see its content.

Finally, we combine pca and ivs to have an ensemble of selected PCs.

If you skipped the computations above, you can read the pre-computed weight PCA ensembles here.

4 Cross-validation

This cross-validation step is also a tuning step. We cross-validate all ensemble members and select one that has the highest mean KGE.

We will also check ensemble averaged models where we average over all p, over all kwf distances, and over both p and kwf.

Before we start, we need to make the cross-validation folds.

We also need to determine whether streamflow needs to be log-transformed for each station.

Running the above line of code now only gives you 30 stations, so we provided the precomputed one.

This code takes many hours to run, so you should only run it on a server (one that has 24-32 cores or so), or on a cluster.

ldsScores <- 
  foreach(s = stationIDs,
          .packages = c('data.table', 'ldsr'),
          .final = function(x) {
            names(x) <- stationIDs
            rbindlist(x, idcol = 'ID')
          }) %dopar% {
      Qa <- instQ[s]
      Z <- cvPoints[[s]]
      transform <- trans[s, trans]
      obs <- if (transform == 'log') log(Qa$Qa) else Qa$Qa
      mu <- mean(obs, na.rm = TRUE)
      instPeriod <- which(1200:2012 %in% Qa$year)
      y <- t(c(rep(NA, Qa[1, year] - 1200),   # Before the instrumental period
               obs - mu,                      # Instrumental period
               rep(NA, 2012 - Qa[.N, year]))) # After the instrumental period
      # Individual models
      Ycv <- lapplyrbind(kwfNames, function(kwfMax) 
               lapplyrbind(pNames, function(p) {
                 u <- t(ensemblePCs[[s]][[kwfMax]][[p]])
                 lapplyrbind(Z, function(z) 
                   data.table(year = Qa$year,
                              Q = one_lds_cv(z, instPeriod, 
                                             mu, y, u, u, 'EM', num.restarts))) 
              }))
      colnames(Ycv)[1:3] <- c('kwf', 'p', 'rep')
      # Different ensembles
      ensP <- Ycv[, .(p = 'Ensemble', Q = mean(Q)), by = .(kwf, rep, year)]
      ensKWF <- Ycv[, .(kwf = 'Ensemble', Q = mean(Q)), by = .(p, rep, year)]
      ens <- Ycv[, .(p = 'Ensemble', kwf = 'Ensemble', Q = mean(Q)), by = .(rep, year)]
      # Merge & calculate scores
      Ycv <- rbind(Ycv, ensP, ensKWF, ens)
      scores <- Ycv[, 
                    {
                      Q <- if (transform == 'log') exp(Q) else Q
                      metrics <- calculate_metrics(Q, Qa$Qa, Z[[rep]])
                      data.table(metric = names(metrics), value = metrics)
                    },
                    by = .(kwf, p, rep)
                  ][, .(value = mean(value)), by = .(kwf, p, metric)]
      out <- list(Ycv = Ycv, scores = scores)
      scores
    }
ldsScores[, type := fifelse(kwf == 'Ensemble' & p == 'Ensemble', 
                                    'both',
                                    fifelse(kwf == 'Ensemble' | p == 'Ensemble',
                                            'one', 
                                            'none'))]
ldsScores[, p := factor(p, levels = c(pNames, 'Ensemble'))]

Let’s plot the distribution of KGE

Averaging over the ensemble does not improve the socres significantly, and it breaks the state-streamflow relationship. So we use the best member instead of the ensemble average.

Let’s now select the best ensemble member and check the KGE value of the selected ones.

Next, we do the following steps

  • Merge this selection to the full ensemble scores to get all the skill scores of the selected members.
  • Keep only the relevant columns
  • Convert to wide format
  • Merge with the metadata table to get the coordinates

We are now ready to plot the skill score distribution (Figure 4). The function to do this is provided in R/plot_metric_map.R.

source('R/plot_metric_map.R')
g1 <- metric_map(scores, 'RE', numClasses = 9, bgMap = bgMap, 
                 dotSize = 1,
                 histPosition = 'bottom', histBarDirection = 'vertical')
g2 <- metric_map(scores, 'CE', numClasses = c(1, 9), bgMap = bgMap,
                 dotSize = 1,
                 histPosition = 'bottom', histBarDirection = 'vertical')
g3 <- metric_map(scores, 'KGE', numClasses = 9, bgMap = bgMap, 
                 dotSize = 1,
                 histPosition = 'bottom', histBarDirection = 'vertical')
p1 <- g1[[1]] + theme(plot.tag.position = c(0.16, 0.98),
                      plot.margin = margin(t = 0.1, r = 0.1, b = 0.1, l = 0.1, unit = 'cm'))
p2 <- g1[[2]] + theme(plot.tag.position = c(0.16, 1.15),
                      plot.margin = margin(t = 1, r = 0.1, b = 0.1, l = 0.1, unit = 'cm'))
p3 <- g2[[1]] + theme(plot.tag.position = c(-0.03, 0.98),
                      axis.text.y = element_blank())
p4 <- g2[[2]] + theme(plot.tag.position = c(-0.03, 1.15),
                      plot.margin = margin(t = 1, r = 0.1, b = 0.1, l = 0.5, unit = 'cm'),
                      axis.text.y = element_blank(), 
                      axis.title.y = element_blank())
p5 <- g3[[1]] + theme(plot.tag.position = c(-0.03, 0.98),
                      plot.margin = margin(t = 0.1, r = 0.1, b = 0.1, l = 0.5, unit = 'cm'),
                      axis.text.y = element_blank())
p6 <- g3[[2]] + theme(plot.tag.position = c(-0.03, 1.15),
                      plot.margin = margin(t = 1, r = 0.1, b = 0.1, l = 0.5, unit = 'cm'),
                      axis.text.y = element_blank(), 
                      axis.title.y = element_blank())
scorePlot <- p1 + p2 + p3 + p4 + p5 + p6 +
  plot_layout(byrow = FALSE, ncol = 3, heights = c(1, 0.65)) +
  plot_annotation(tag_levels = 'a', tag_suffix = ')') &
  theme(plot.tag = element_text(size = 12, face = 'bold'))
scorePlot

5 Streamflow history

We now build the full reconstruction using the selected models. This code takes about 30 seconds, and will give you reconstruction results for all available stations. To get results for all stations, skip to the next chunk to read the pre-computed results.

We also provide the full 30-member ensemble results in the file results/ldsFitFull.RDS, so you can inspect other models as well.

Now let’s plot the reconstructed flow history (Figure 5). The plot function is provided in R/flow_history.R We need to normalize reconstructed flow with the observed mean and standard deviation, these are provided in the file data/instQsummary.csv.

fhAnnotated <- fhWrap +
  draw_line(c(0.0755, 0.1392), yLine, colour = 'steelblue', size = 0.2) +
  draw_line(c(0.0995, 0.1408), yLine, colour = 'steelblue', size = 0.2) +
  draw_line(c(0.1195, 0.2375), yLine, colour = 'steelblue', size = 0.2) +
  draw_line(c(0.3700, 0.2705), yLine, colour = 'steelblue', size = 0.2) +
  draw_line(c(0.3900, 0.3000), yLine, colour = 'steelblue', size = 0.2) +
  draw_line(c(0.5925, 0.3262), yLine, colour = 'steelblue', size = 0.2) +
  draw_line(c(0.6122, 0.3565), yLine, colour = 'steelblue', size = 0.2) +
  draw_line(c(0.6361, 0.3575), yLine, colour = 'steelblue', size = 0.2) +
  draw_line(c(0.6575, 0.5637), yLine, colour = 'steelblue', size = 0.2) +
  draw_line(c(0.6889, 0.5666), yLine, colour = 'steelblue', size = 0.2) +
  draw_line(c(0.7100, 0.6950), yLine, colour = 'steelblue', size = 0.2) +
  draw_line(c(0.8145, 0.7082), yLine, colour = 'steelblue', size = 0.2) +
  draw_line(c(0.8353, 0.7325), yLine, colour = 'steelblue', size = 0.2) +
  draw_line(c(0.8915, 0.7392), yLine, colour = 'steelblue', size = 0.2) +
  draw_line(c(0.9120, 0.7608), yLine, colour = 'steelblue', size = 0.2) +
  draw_line(c(0.9358, 0.7615), yLine, colour = 'steelblue', size = 0.2) +
  draw_line(c(0.9565, 0.8285), yLine, colour = 'steelblue', size = 0.2) +
  draw_line(c(0.9805, 0.8305), yLine, colour = 'steelblue', size = 0.2) +
  draw_text(paste0('(', 1:9, ')'),
            c(0.090, 0.244, 0.490, 0.623, 0.674, 0.767, 0.863, 0.921, 0.965),
            yText, size = 9) +
  draw_text(c('a)', 'b)'), 0.03, c(0.95, 0.4), size = 11, fontface = 'bold')
fhAnnotated

Zoom in to the instrumental period (Figure S5a).

6 SST correlation

6.1 Overall correlations (1855–2012)

Figure 6.

6.2 Breaking down into sub-periods

Figure S6, but without instrumental data for the Mekong and Yangtze.

# Instrumental data
outletInst <- instQ[outletLookup][year %in% 1955:2004] %>% split(by = 'ID')

sstCorInst <- 
  foreach(Q = outletInst, .combine = rbind, .packages = 'data.table') %:%
    foreach(case = combi, .combine = rbind) %dopar%
        get_sst_cor(Q, case$season, case$lag, 'Qa')

# Set factors to determine plot order
sstCorInst[season == 'DJF (-1)', season := 'DJF (-2)']
sstCorInst[season == 'DJF'     , season := 'DJF (-1)']
sstCorInst[season == 'DJF (+1)', season := 'DJF']
sstCorInst$season %<>% factor(levels = c('JJA (-1)', 'SON (-1)', 'DJF (-1)', 
                                     'MAM',      'JJA',      'SON',
                                     'DJF'))
sstCorInst$river %<>% factor(levels = c('Krishna', 'Chao Phraya', 'Mekong', 'Yangtze'))
sstCorInst[, signif := p.value < 0.05]
setkey(sstCorInst, long, lat)
sstCorInst <- sstCorInst[sstxy, on = c('long', 'lat')]
sstSignifAreaInst <- sstCorInst[, signif_area(.SD, 2, 2), by = .(season, river)]

# Reconstruction, breaking into periods
sstCor <- 
  foreach(Q = outlet, .combine = rbind, .packages = 'data.table') %:%
    foreach(case = combi, .combine = rbind) %:%
      foreach(startYear = c(1855, 1905, 1955), .combine = rbind) %dopar% {
        ans <- get_sst_cor(Q[between(year, startYear, startYear + 49)], 
                           case$season, 
                           case$lag,
                           'Q')
        ans[, period := startYear][]
      }
# Set factors to determine plot order
sstCor[season == 'DJF (-1)', season := 'DJF (-2)']
sstCor[season == 'DJF'     , season := 'DJF (-1)']
sstCor[season == 'DJF (+1)', season := 'DJF']
sstCor$season %<>% factor(levels = c('JJA (-1)', 'SON (-1)', 'DJF (-1)', 
                                     'MAM',      'JJA',      'SON',
                                     'DJF'))
sstCor$river %<>% factor(levels = c('Krishna', 'Chao Phraya', 'Mekong', 'Yangtze'))
sstCor[, signif := p.value < 0.05]
setkey(sstCor, long, lat)
sstCor <- sstCor[sstxy, on = c('long', 'lat')]
sstSignifArea <- sstCor[, signif_area(.SD, 2, 2), by = .(season, river, period)]
periodLab <- function(x, dx = 49) paste0('Reconstruction, ', x, '\u2013', x + dx)
limits <- absRange(sstCor$cor)

sstCorPlot <- 
  plot_sst_cor(sstCorInst, sstSignifAreaInst, sstLand, limits = limits) + 
  labs(title = 'Instrumental streamflow, 1955–2004') +
  theme(legend.position = 'bottom', 
        legend.key.width = unit(2, 'cm'),
        strip.text.y = element_text(size = 9),
        strip.text.x = element_text(size = 11),
        strip.background.x = element_blank(),
        axis.line = element_blank())
legend <- get_legend(sstCorPlot)
# Add empty space to the right to replace Mekong and Yangtze
sstCorPlot2 <- plot_grid(sstCorPlot + theme(legend.position = 'none'), NULL, ncol = 2)

instCorPlot <- ggdraw(sstCorPlot2 + 
                        theme(plot.margin = margin(l = 0.7, r = 0.2, unit = 'cm'))) +
  draw_label('Decaying ENSO', x = 0.02, y = 0.75, angle = 90, size = 10) +
  draw_label('Ongoing ENSO', x = 0.02, y = 0.3, angle = 90, size = 10)

pl <- lapply(c(1855, 1905, 1955), function(x) {
  p <- plot_sst_cor(sstCor[period == x], sstSignifArea[period == x], sstLand, 
                    limits = limits) +
    labs(title = periodLab(x)) +
    theme(strip.text.y = element_text(size = 9),
          strip.text.x = element_text(size = 11),
          strip.background.x = element_blank(),
          axis.line = element_blank())
  ggdraw(p +
           theme(legend.position = 'none',
                 plot.margin = margin(l = 0.7, r = 0.2, unit = 'cm'))) +
    draw_label('Decaying ENSO', x = 0.02, y = 0.75, angle = 90, size = 10) +
    draw_label('Ongoing ENSO', x = 0.02, y = 0.3, angle = 90, size = 10)
})

plots <- plot_grid(pl[[1]], pl[[2]], pl[[3]], instCorPlot, 
                   ncol = 2, labels = paste0(letters[1:4], ')'))
legend_plot <- plot_grid(NULL, legend, rel_widths = c(0.5, 1))
plot_grid(plots, legend_plot, nrow = 2, rel_heights = c(15, 1))